US_Mobility2021.csv is the 2021 U.S. Transportation Mobility data, which contains statistics for each state for six daily traffic situations, which are retail and recreation percent change,grocery and pharmacy percent change,parks percent change,transit stations percent,workplaces percent change,residential percent change.
The main data examined by our group are statistics on the distribution of covid in U.S. in different situations in different states in 2021.
google mobility data: https://www.google.com/covid19/mobility/
covid infected data: https://github.com/owid/covid-19-data/blob/master/public/data/hospitalizations/locations.csv
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import plotly.express as px
import seaborn as sns
import plotly.graph_objs as go
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from IPython.display import display, Markdown
import warnings
from plotly.offline import plot
# Load the data
dfcovid = pd.read_csv('US_Covid.csv')
dfcovid['date'] = pd.to_datetime(dfcovid['date'])
dfcovid = dfcovid.sort_values(by='date')
dfcovid.head()
| state | date | critical_staffing_shortage_today_yes | critical_staffing_shortage_today_no | critical_staffing_shortage_today_not_reported | critical_staffing_shortage_anticipated_within_week_yes | critical_staffing_shortage_anticipated_within_week_no | critical_staffing_shortage_anticipated_within_week_not_reported | hospital_onset_covid | hospital_onset_covid_coverage | ... | previous_day_admission_pediatric_covid_confirmed_5_11_coverage | previous_day_admission_pediatric_covid_confirmed_unknown | previous_day_admission_pediatric_covid_confirmed_unknown_coverage | staffed_icu_pediatric_patients_confirmed_covid | staffed_icu_pediatric_patients_confirmed_covid_coverage | staffed_pediatric_icu_bed_occupancy | staffed_pediatric_icu_bed_occupancy_coverage | total_staffed_pediatric_icu_beds | total_staffed_pediatric_icu_beds_coverage | Demo | State Geographic Boundaries | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 31596 | LA | 2020-01-01 | 0 | 0 | 1 | 0 | 0 | 1 | NaN | 0 | ... | 0 | NaN | 0 | NaN | 0 | NaN | 0 | NaN | 0 | NaN |
| 30836 | MT | 2020-01-01 | 0 | 0 | 1 | 0 | 0 | 1 | 0.0 | 1 | ... | 0 | NaN | 0 | NaN | 0 | NaN | 0 | NaN | 0 | NaN |
| 129 | NC | 2020-01-01 | 0 | 0 | 1 | 0 | 0 | 1 | 0.0 | 1 | ... | 0 | NaN | 0 | NaN | 0 | NaN | 0 | NaN | 0 | NaN |
| 32307 | PR | 2020-01-01 | 0 | 0 | 1 | 0 | 0 | 1 | 0.0 | 1 | ... | 0 | NaN | 0 | NaN | 0 | NaN | 0 | NaN | 0 | NaN |
| 29601 | MN | 2020-01-01 | 0 | 0 | 1 | 0 | 0 | 1 | 0.0 | 1 | ... | 0 | NaN | 0 | NaN | 0 | NaN | 0 | NaN | 0 | NaN |
5 rows × 136 columns
In the scope of our research project, our primary emphasis revolves around the analysis of crucial statistics, specifically concerning the following key metrics: the total count of adult patients hospitalized with confirmed cases and the total number of fatalities.
Fistly we write two pieces of code by using choropleth from plotly express, which is about the animation of the distribution of the number of people total adult patients hospitalized confirmed and deaths in each state in the United States in 2021.
# Choropleth map
fig = px.choropleth(dfcovid,
locations='state',
locationmode='USA-states',
color='total_adult_patients_hospitalized_confirmed_covid',
hover_name='state',
animation_frame='date',
title='COVID hospitalized comfirmed by State over Time',
color_continuous_scale='YlGnBu',
scope='usa'
)
fig.show()
# the animation is shown in another html file called "choropleth_map.html"
#plot(fig, filename='choropleth_map.html', auto_open=False)
Then we analyzed the two sets of data separately.
first with a line graph of the overall number of diagnoses in the U.S. despite time.
# there is a ineliminatable future warning in this cell, I'll tried to use its recommandation np.array and update matplot.express, none of them worked
warnings.simplefilter(action='ignore', category=FutureWarning)
# Line chart
fig = px.line(dfcovid, x='date', y='total_adult_patients_hospitalized_confirmed_covid',
title='COVID-19 Confirmed Cases Over Time')
# the animation is shown in another html file called "line_chart.html"
fig.show()
#plot(fig, filename='line_chart.html', auto_open=False)
January, August 2021 and early 2022 can be seen as the Covid explosion period.
And then with a pie chart of the distribution of deaths by state.
# Pie chart
fig = px.pie(dfcovid,
names='state',
values='deaths_covid',
title='COVID-19 Deaths by State')
fig.update_traces(textinfo='none')
# the animation is shown in another html file called "pie_chart.html"
fig.show()
#plot(fig, filename='pie_chart.html', auto_open=False)
The four states with the most deaths are New York, California, Texas and Florida.
Based on our comprehensive statistical examination of the data mentioned earlier, our research team opted to focus our attention on New York. This choice was made due to New York's status as a highly representative state during the pandemic.
dfcovid['date'] = pd.to_datetime(dfcovid['date'])
# # Replace 'state' with the correct column name 'state_code' in the filter
dfcovid = dfcovid[(dfcovid['state'] == 'NY') & #Only focus on New York
(dfcovid['date'] >= '2021-01-01') &
(dfcovid['date'] <= '2021-12-31')]
# # Extract the 'total_adult_patients_hospitalized_confirmed_covid' column for NY
dfcovid = dfcovid[['state', 'date', 'total_adult_patients_hospitalized_confirmed_covid', 'deaths_covid']]
# # Convert 'date' to datetime format for proper sorting
dfcovid.set_index('date', inplace=True)
dfcovid = dfcovid.sort_values(by='date')
dfcovid_filter=dfcovid[['total_adult_patients_hospitalized_confirmed_covid', 'deaths_covid']]
# # Sort the DataFrame by 'date'
dfcovid.head()
| state | total_adult_patients_hospitalized_confirmed_covid | deaths_covid | |
|---|---|---|---|
| date | |||
| 2021-01-01 | NY | 7926.0 | 103.0 |
| 2021-01-02 | NY | 8099.0 | 114.0 |
| 2021-01-03 | NY | 8406.0 | 142.0 |
| 2021-01-04 | NY | 8636.0 | 115.0 |
| 2021-01-05 | NY | 8723.0 | 131.0 |
## Calculate and visualize the correlation matrix
# Import US_Mobility data
dfmobi = pd.read_csv('US_Mobility.csv')
dfmobi.head()
selected_columns = ['place_id', 'date',
'retail_and_recreation_percent_change_from_baseline',
'grocery_and_pharmacy_percent_change_from_baseline',
'parks_percent_change_from_baseline',
'transit_stations_percent_change_from_baseline',
'workplaces_percent_change_from_baseline',
'residential_percent_change_from_baseline']
dfmobi['date'] = pd.to_datetime(dfmobi['date'])
dfmobi=dfmobi[selected_columns]
dfmobi= dfmobi[(dfmobi['place_id'].isin(['ChIJqaUj8fBLzEwRZ5UY3sHGz90'])) &
(dfmobi['date'] >= '2021-01-01') &
(dfmobi['date'] <= '2021-12-31')]
# dfcovid_filter
dfmobi.set_index('date', inplace=True)
df_merged = pd.concat([dfmobi, dfcovid_filter],axis=1)
df_merged.columns = ['Place_ID', 'Retail_Recreation', 'Grocery_Pharmacy', 'Parks',
'Transit_Stations', 'Workplaces', 'Residential','Adult_hospitalized_confirmed', 'Deaths']
# Initializing MinMaxScaler
scaler = MinMaxScaler()
# Select the columns to be normalized (except the 'Place_ID' column)
cols_to_normalize = df_merged.columns.difference(['Place_ID'])
# Normalize these columns
df_merged[cols_to_normalize] = scaler.fit_transform(df_merged[cols_to_normalize])
# Calculate the correlation coefficient matrix
correlation_matrix = df_merged.drop(['Place_ID'], axis=1).corr()
# Heatmap plotting
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
correlation_matrix
| Retail_Recreation | Grocery_Pharmacy | Parks | Transit_Stations | Workplaces | Residential | Adult_hospitalized_confirmed | Deaths | |
|---|---|---|---|---|---|---|---|---|
| Retail_Recreation | 1.000000 | 0.880804 | 0.715109 | 0.818977 | 0.365272 | -0.695841 | -0.786249 | -0.783628 |
| Grocery_Pharmacy | 0.880804 | 1.000000 | 0.514522 | 0.647740 | 0.189574 | -0.474913 | -0.607726 | -0.606970 |
| Parks | 0.715109 | 0.514522 | 1.000000 | 0.700282 | 0.322325 | -0.736802 | -0.730443 | -0.722042 |
| Transit_Stations | 0.818977 | 0.647740 | 0.700282 | 1.000000 | 0.678384 | -0.895651 | -0.710460 | -0.680441 |
| Workplaces | 0.365272 | 0.189574 | 0.322325 | 0.678384 | 1.000000 | -0.819535 | -0.207193 | -0.173417 |
| Residential | -0.695841 | -0.474913 | -0.736802 | -0.895651 | -0.819535 | 1.000000 | 0.629891 | 0.603296 |
| Adult_hospitalized_confirmed | -0.786249 | -0.607726 | -0.730443 | -0.710460 | -0.207193 | 0.629891 | 1.000000 | 0.965070 |
| Deaths | -0.783628 | -0.606970 | -0.722042 | -0.680441 | -0.173417 | 0.603296 | 0.965070 | 1.000000 |
The correlation heatmap measures the linear relationships between different indicators of the Google Mobility dataset. The values in the matrix range from -1 to 1, where:
The following is the analysis based on the values of these correlations:
Retail Recreation exhibits a remarkably strong correlation with Grocery Pharmacy (0.88), indicating a robust positive relationship. This suggests a synchronization between activities in retail and recreation with purchases in groceries and pharmacies.
Retail Recreation also demonstrates a substantial positive correlation with Parks (0.72), implying a highly positive relationship between outdoor park activities and retail and recreation activities. Additionally, it maintains a notably high correlation with Transit Stations (0.82).
Moreover, Retail Recreation has a relatively moderate correlation with Workplaces (0.37), showcasing a distinct but less pronounced association.
Workplaces represents the level of activity in places of work, such as offices, factories, commercial areas, etc.
Correlation with Retail Recreation: 0.365272, indicating a moderate positive correlation with retail and recreation activities.
Correlation with Transit Stations: 0.678384, suggesting a moderate positive correlation with activities near transit stations.
Correlation with Residential: -0.819535, showing a strong negative correlation with activities at residential places.
Retail Recreation and Grocery Pharmacy: Both "Retail Recreation" (-0.79) and "Grocery Pharmacy" (-0.61) show a strong negative correlation with "Adult hospitalized confirmed". This suggests that higher activity in retail and grocery/pharmacy sectors is associated with a decrease in the number of adult hospitalized confirmed cases.
Parks: "Parks" (-0.73) also displays a notable negative correlation with "Adult hospitalized confirmed". This implies that more activity in parks is associated with a decrease in adult hospitalized confirmed cases.
Transit Stations: "Transit Stations" (-0.71) has a strong negative correlation with "Adult hospitalized confirmed". This indicates that higher activity around transit stations is correlated with a decrease in the number of adult hospitalized confirmed cases.
Workplaces: "Workplaces" (-0.21) has a relatively weaker negative correlation with "Adult hospitalized confirmed". The correlation suggests that higher activity in workplaces is somewhat associated with a decrease in adult hospitalized confirmed cases.
Residential: "Residential" (0.63) exhibits a strong positive correlation with "Adult hospitalized confirmed". This indicates that an increase in residential activity is associated with an increase in the number of adult hospitalized confirmed cases.
Retail Recreation and Grocery Pharmacy: Similar to "Adult hospitalized confirmed", both "Retail Recreation" (-0.78) and "Grocery Pharmacy" (-0.61) demonstrate a strong negative correlation with "Deaths". This suggests that higher activity in retail and grocery/pharmacy sectors is associated with a decrease in the number of deaths.
Parks: "Parks" (-0.72) also displays a notable negative correlation with "Deaths", indicating that more activity in parks is associated with a decrease in the number of deaths.
Transit Stations: "Transit Stations" (-0.68) has a strong negative correlation with "Deaths", implying that higher activity around transit stations is correlated with a decrease in the number of deaths.
Workplaces: "Workplaces" (-0.17) has a relatively weaker negative correlation with "Deaths", suggesting that higher activity in workplaces is somewhat associated with a decrease in the number of deaths.
Residential: "Residential" (0.60) exhibits a strong positive correlation with "Deaths", indicating that an increase in residential activity is associated with an increase in the number of deaths.
#Determine elements of strong correlation
def check_strong_correlation(correlation_dict, threshold=0.7):
strong_correlations = []
for i, correlation_coefficient in correlation_dict.items():
if abs(correlation_coefficient) >= threshold and correlation_coefficient != 1.0:
strong_correlations.append((i, correlation_coefficient))
return strong_correlations
correlation_confirmed=correlation_matrix['Adult_hospitalized_confirmed']
strong_confirmed=check_strong_correlation(correlation_confirmed)
correlation_deaths=correlation_matrix['Deaths']
strong_deaths=check_strong_correlation(correlation_deaths)
print('The high correlation between confirmed number and onther indecators:',strong_confirmed)
print('The high correlation between deaths number and onther indecators:',strong_deaths)
The high correlation between confirmed number and onther indecators: [('Retail_Recreation', -0.7862485622651575), ('Parks', -0.730442635152799), ('Transit_Stations', -0.7104600729971495), ('Deaths', 0.9650701771452177)]
The high correlation between deaths number and onther indecators: [('Retail_Recreation', -0.7836276981167368), ('Parks', -0.7220418391693083), ('Adult_hospitalized_confirmed', 0.9650701771452177)]
Retail Recreation:
Parks:
Transit Stations:
Retail Recreation:
Parks:
# Plot trends for each data column separately
def plot(strong_list, df_merged):
data_columns = [item[0] for item in strong_list]
if strong_list == strong_confirmed:
reference_column = 'Adult_hospitalized_confirmed'
reference_label = 'Adult Hospitalized (Confirmed Cases)'
elif strong_list == strong_deaths:
reference_column = 'Deaths'
reference_label = 'Deaths'
else:
raise ValueError("Invalid strong_list provided.")
for column in data_columns:
plt.figure(figsize=(12, 4))
plt.plot(df_merged.index, df_merged[column], label=column)
plt.plot(df_merged.index, df_merged[reference_column], label=reference_label, linestyle='--')
plt.xlabel('Date')
plt.ylabel('Normalized Value')
plt.title(f'Trend of {column} with {reference_label}')
plt.legend()
plt.show()
plot(strong_confirmed, df_merged)
The graphs exhibited the relationship between mobility trends in New York State and the number of adult hospitalizations due to COVID-19 over the course of a year from January 2021 to January 2022. Each graph shows the normalized values of a specific type of mobility or outcome against the normalized values of adult hospitalizations for confirmed cases.
Retail and Recreation with Adult Hospitalized (Confirmed Cases): In this graph, the blue line representing retail and recreation activity fluctuates throughout the year but generally exhibits declines during periods where the orange dashed line, representing adult hospitalizations, peaks. This suggests that when COVID-19 hospitalizations increased, activities at retail and recreation locations decreased, possibly due to lockdown measures, restrictions, or voluntary changes in public behavior to reduce the risk of transmission.
Parks with Adult Hospitalized (Confirmed Cases): The activity in parks does not show as strong an inverse correlation with hospitalizations as retail and recreation do. In fact, during some periods where hospitalizations increased, park visits also increased, possibly because outdoor spaces were seen as safer alternatives for leisure activities and exercise, especially when indoor venues were restricted, however, if we do not look so closely and due to the fact of their correlationship is over 0.7, we can say it is inversely correlated in general.
Transit Stations with Adult Hospitalized (Confirmed Cases): Similar to retail and recreation, the use of transit stations also generally decreases as hospitalizations increase. This could be due to reduced commuting because of remote work policies, lockdowns, or a public preference to avoid crowded places such as transit hubs to lower the risk of contracting the virus.
Deaths with Adult Hospitalized (Confirmed Cases): There's a visible correlation between the trends in deaths and adult hospitalizations, with both metrics rising and falling in tandem. This suggests a direct relationship between the severity of the COVID-19 outbreak and mortality rates. Increases in hospitalization are mirrored by increases in deaths, indicating the periods when the healthcare system was under the most strain.
plot(strong_deaths, df_merged)
These three graphs present a comparison of different public activities and COVID-19 related deaths in New York State, showing the data trends from January 2021 to January 2022.
Trend of Retail_Recreation with Deaths: This graph illustrates the relationship between the normalized value of visits to retail and recreation venues (solid blue line) and the normalized value of COVID-19 related deaths (dashed orange line). There appears to be an inverse relationship in some portions of the graph where deaths peak, particularly in the earlier months, and retail and recreation activity shows some decline. However, the correlation isn't strictly inverse; there are periods, especially towards the end of the year, where both metrics rise simultaneously, which could indicate changing public behavior or the implementation of new health and safety protocols allowing for increased retail activity even as deaths rise.
Trend of Parks with Deaths: In the second graph, the activity in parks (solid blue line) seems less directly correlated with the death rate (dashed orange line). Parks usage shows high variability and does not consistently decline with peaks in the death rate, which suggests that outdoor activities might have been perceived as less risky or that people continued to visit parks for recreation despite the fluctuating death rate due to the pandemic.
Trend of Adult_hospitalized_confirmed with Deaths: The third graph shows a very close correlation between the trends in adult hospitalizations for confirmed cases (solid blue line) and deaths (dashed orange line). This is to be expected as more severe cases that lead to hospitalization could subsequently result in higher mortality. The similarity in trends underscores the direct impact of COVID-19 on health outcomes, with deaths lagging slightly behind hospitalizations, as would occur naturally in the progression of the disease.
Be specific. Some of the tasks can be coding (expect everyone to do this), background research, conceptualisation, visualisation, data analysis, data modelling
Max, Li: Part of background research, research objectives. Coding for data import and processing. Data analysis for highly correlated mobility and corona-infection data. Notebook structure, last check of consistency,export the animations as seperate html files as they cannot be exhibited in the original html file.
Mingyan, Jin: Import and analyze the data 'Us_covid.csv' and 'Us_mobility2021.csv', utilizing different chart like Animation, Pie chart and Histogram in visualization, to reflect severity of covid-19 in different states in USA. Therefore select New York to continue study in correlation analysis forward.
Heran, Zhao: Objectives of data correlation analysis, conducting correlation analysis, creating and analyzing the correlation heatmap, identifying strong correlations, summarizing the strong correlations results.